add python accelerator HAL for multi-vendor backends#86
add python accelerator HAL for multi-vendor backends#86theap06 wants to merge 12 commits intofacebookresearch:mainfrom
Conversation
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
|
Could I get some feedback? This is tackling my aforementioned feature request on supporting other hardware |
luccabb
left a comment
There was a problem hiding this comment.
adds scaffold backends for ROCm, LevelZero, TPU, and Neuron
lets just keep NVML since its the only backend supported in this PR
this should be a full run of the cmd with the corresponding output for each |
44b06c0 to
49fa293
Compare
luccabb
left a comment
There was a problem hiding this comment.
gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...
this should be a full run of the cmd with the corresponding output for each
|
@luccabb the hardware backend works and the functionality works on my end with a gpu instance. let me know if you have any other feedback |
|
@luccabb could I get some feedback? I added the testing for HAL in the pr message. |
|
@claude can you review this pr? |
|
I'll analyze this and get back to you. |
|
@claude can you review this pr? |
|
Claude finished @luccabb's task in 1m 52s —— View job PR Review: add python accelerator HAL for multi-vendor backends
Overall, this is a clean, well-structured HAL abstraction. The module layout, error hierarchy, Protocol-based interface, and test coverage are all solid. A few issues need to be addressed before merging. Critical Issues1. In both def main(detach: bool, backend: str) -> None:
_ = backend # ← never forwarded anywhereThe option appears in 2. In model = client.get_device_by_index(index).get_vbios_version()
Minor Issues3. In def close(self) -> None:
self._client = None # ← just drops the reference; no shutdown call
return NoneIf 4. In 5. In @dataclass
class NVMLBackend(AcceleratorBackend):
Observations / Non-blocking
SummaryThe HAL design is solid and the tests are thorough. The main issue is that the |
33f4abe to
f312b33
Compare
|
@claude can I get feedback |
|
@luccabb I think it should be ready for prod |
|
Claude finished @luccabb's task in 1m 29s —— View job PR Re-Review: add python accelerator HAL for multi-vendor backends
All five issues from the previous review have been addressed. Here's the status: Previously Critical — Now Fixed
Previously Minor — Now Fixed
One remaining minor issue In if isinstance(ctx.obj, dict):
ctx.obj["accelerator_backend"] = backendCompare to if not isinstance(ctx.obj, dict):
ctx.obj = {}
ctx.obj["accelerator_backend"] = backendIf Summary: All critical and minor issues from the prior review are fixed. The remaining item above is a small defensive programming issue in |
|
@claude could I get some feedback on this pr? |
|
Claude finished @luccabb's task in 1m 18s —— View job PR Re-Review: add python accelerator HAL for multi-vendor backends
The one remaining issue from the previous review was partially addressed. Here's the current state: Remaining Issue from Last Review — Partially Addressed In ctx.meta["accelerator_backend"] = backend # ← added
if isinstance(ctx.obj, dict):
ctx.obj["accelerator_backend"] = backend # ← still conditionalThis doesn't match the defensive pattern used in if not isinstance(ctx.obj, dict):
ctx.obj = {}
ctx.obj["accelerator_backend"] = backendAdditionally, The fix is still straightforward: Fix this → Everything else looks good. All critical and minor issues from both prior reviews remain fixed. The HAL design, test coverage, and CLI wiring are solid. This is ready to merge once the |

Introduce a hardware-agnostic accelerator abstraction layer with normalized metrics, backend management, and runtime probing. Includes a functional NVML backend plus ROCm/LevelZero/TPU/Neuron scaffolds and dedicated HAL tests.
Adds a Python-first hardware-agnostic accelerator HAL at gcm/monitoring/accelerator.
Decouples telemetry collection from NVML-only assumptions via a common backend interface and normalized metrics.
Implements functional NVMLBackend; adds scaffold backends for ROCm, LevelZero, TPU, and Neuron
Implements Feature Request #74
Test Plan:
Ran HAL tests:
12 passed